Allow parallel execution of NAS backup and delete commands#12847
Allow parallel execution of NAS backup and delete commands#12847jmsperu wants to merge 2 commits into
Conversation
Change executeInSequence() to return false for TakeBackupCommand and DeleteBackupCommand, allowing the KVM agent to process multiple backup/delete operations concurrently via its worker thread pool. Previously, all backup commands were serialized — a large VM backup (e.g. 100+ GB taking 2+ hours) would block all other backup and delete operations on the same host. Since each backup mounts its own temporary NFS directory and operates on independent VM disks, there is no shared state requiring serialization. Restore and PrepareForBackupRestoration commands remain sequential as they modify VM state that should not be concurrent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.20 #12847 +/- ##
============================================
+ Coverage 16.24% 16.26% +0.01%
- Complexity 13411 13432 +21
============================================
Files 5664 5665 +1
Lines 500463 500572 +109
Branches 60779 60792 +13
============================================
+ Hits 81308 81420 +112
+ Misses 410059 410046 -13
- Partials 9096 9106 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR changes CloudStack backup agent command scheduling by making the NAS backup take and delete commands eligible for parallel execution on the agent (instead of being forced through the single-threaded “in-sequence” queue).
Changes:
- Make
TakeBackupCommand.executeInSequence()returnfalse. - Make
DeleteBackupCommand.executeInSequence()returnfalse.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| core/src/main/java/org/apache/cloudstack/backup/TakeBackupCommand.java | Marks backup creation command as non-sequential (parallelizable) on the agent. |
| core/src/main/java/org/apache/cloudstack/backup/DeleteBackupCommand.java | Marks backup deletion command as non-sequential (parallelizable) on the agent. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| @Override | ||
| public boolean executeInSequence() { | ||
| return true; | ||
| return false; |
| @Override | ||
| public boolean executeInSequence() { | ||
| return true; | ||
| return false; |
|
@JoaoJandre could you have a look at whether this change would affect KBOSS? |
@winterhazel KBOSS uses its own commands to send messages to the agent, thus this does not affect it. |
|
@jmsperu could you have a look at @abh1sar's comment at #12847 (comment)? |
|
Thanks for the ping @winterhazel — and @abh1sar's point at #12847 (comment) is well taken. DeleteBackup needs sequential execution because parallel deletes against the same NAS path can race on the cleanup of the chain (we've also seen the rm -rf vs mount/unmount interleave produce stuck mounts). Will restrict the parallelization to TakeBackup only in the next push — DeleteBackup keeps its current sequential behaviour. Thanks @JoaoJandre for the KBOSS confirmation too. |
Revert DeleteBackupCommand to executeInSequence()=true per @abh1sar's review. Backup creation is safe to parallelize across VMs because each VM writes to its own NAS path. Backup deletion is not safe to parallelize the same way: incremental chains share parent qcow2 files on the NAS, and concurrent deletions of sibling or descendant nodes can race on the underlying file references and the chain metadata. Until that's modeled explicitly, keep DeleteBackup serialized. TakeBackup parallelization stands.
|
@blueorangutan package |
|
@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Notes on the current red checks — all three appear unrelated to this change (which is a single
Happy to rebase to retrigger if a reviewer prefers a fresh CI run, but the change itself is unaffected by these matrices. Other 14 checks green, 8 still running. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17997 |
Summary
executeInSequence()fromtruetofalseinTakeBackupCommandandDeleteBackupCommandRestoreBackupCommandandPrepareForBackupRestorationCommandremain sequential (they modify VM state)Motivation
Currently all backup commands are serialized on the agent — a large VM backup (e.g. 100+ GB taking 2+ hours) blocks all other backup and delete operations on the same host. This is the root cause of backup schedule delays and timeouts in environments with many VMs per host.
Each backup operation:
mktemp -d -t csbackup.XXXXX)There is no technical reason to serialize them. The agent already has a thread pool (
requestHandler) that can execute multiple commands concurrently — this change simply allows backup commands to use it.Impact
Test plan
virsh domjobinfoon multiple VMs)-bbandwidth throttle if NFS saturates